01 Workflow Introduction ¶
Revealing your data (nearly) effortlessly,
at every step in your workflow
Workflow from data to decision ¶
If there's no visualization at any of these stages, you're flying blind.
But visualization is often skipped as too hard to construct, particularly for big data.
What if it were simple to visualize anything, anywhere?
Good news/
Bad news
Lots of choices!
Too hard to
try them all,
learn them all, or
get them to work together.
PyViz:
Seamless interoperability
for browser-based
viz tools
Supported by Anaconda, Inc.
import numpy as np
import pandas as pd
import holoviews as hv
import datashader.transfer_functions as tf
%matplotlib inline
hv.extension('bokeh', 'matplotlib', width="100")
%opts Curve [width=600 height=250 tools=['hover'] ] {+framewise} VLine (color="black")
%opts Bars [width=800 height=400 tools=['hover'] group_index=1 legend_position='top_left' xrotation=90]
Exploring Pandas Dataframes ¶
If your data is in a Pandas dataframe, it's natural to explore it using the
.plot()
method (based on Matplotlib). Let's look at a
dataset of the number of cases of measles and pertussis
(per 100,000 people) over time in each state:
df = pd.read_csv('../data/diseases.csv.gz')
df.head()
Just calling
.plot()
won't give anything meaningful, because it doesn't know what should be plotted against what:
df.plot();
But with some Pandas operations we can pull out parts of the data that make sense to plot:
import numpy as np
measles_by_year = df[["Year","measles"]].groupby("Year").aggregate(np.sum)
measles_by_year.plot();
Here it is easy to see that the 1963 introduction of a measles vaccine brought the cases down to negligible levels.
By default, the tools below ignore the Pandas index, so we'll make it into a real column for the rest of this notebook:
measles_by_year = measles_by_year.reset_index()
Exploring Data with HoloViews and Bokeh ¶
The above plots are just static images, but it can be just as simple to get fully interactive plots in a web browser, with hover, pan, and zoom, by using HoloViews to get a Bokeh plot:
hv.Curve(measles_by_year)
m = hv.Curve(measles_by_year) * hv.VLine(1963) * \
hv.Text(1963, 27000, " Vaccine introduced", halign='left')
m
while still always being able to access the original data involved for further analysis:
print(m)
m.Curve.I.data.head()
With other plotting libraries, each plot you make will be a dead end, discouraging you from investing in it, but HoloViews objects preserve the full data throughout plotting, slicing, sampling, and other operations.
It's also easy to break down the data in different ways, such as to look at each state individually:
ds = hv.Dataset(df, ['Year', 'State'], 'measles').aggregate(function=np.sum)
measles_by_state = ds.to(hv.Curve, 'Year', 'measles')
measles_by_state * hv.VLine(1963)
Or pull out a couple of those to put side by side:
measles_by_state["Texas"] + measles_by_state["New York"]
Or to compare four states over time by overlaying:
states = ['New York', 'New Jersey', 'California', 'Texas']
measles_by_state.select(State=states, Year=(1930, 2005)).overlay() * hv.VLine(1963)
Or by faceting:
%%opts Curve [width=200, height=100]
measles_by_state.select(State=states, Year=(1930, 2005)).grid('State') * hv.VLine(1963)
Or as Bars or many other types of plots:
ds.select(State=states, Year=(1980, 1990)).to(hv.Bars, ['Year', 'State'], 'measles').sort()
Or with error bars:
agg = ds.aggregate('Year', function=np.mean, spreadfn=np.std)
(hv.Curve(agg) * hv.ErrorBars(agg,vdims=['measles', 'measles_std'])).redim.range(measles=(0, None)) * hv.VLine(1963)
If we really want to invest a lot of time in making a fancy plot, we can customize it to try to show all the yearly data about measles at once:
url = 'https://raw.githubusercontent.com/blmoore/blogR/master/data/measles_incidence.csv'
data = pd.read_csv(url, skiprows=2, na_values='-')
yearly_data = data.drop('WEEK', axis=1).groupby('YEAR').sum().reset_index()
measles = pd.melt(yearly_data, id_vars=['YEAR'], var_name='State', value_name='Incidence')
heatmap = hv.HeatMap(measles, label='Measles Incidence')
aggregate = hv.Dataset(heatmap).aggregate('YEAR', np.mean, np.std)
marker = hv.Text(1963, 800, u'\u2193 Vaccine introduced', halign='left')
agg = hv.ErrorBars(aggregate) * hv.Curve(aggregate).opts(plot=dict(xrotation=90))
hm_opts = dict(width=900, height=500, tools=['hover'], logz=True, invert_yaxis=True,
xrotation=90, labelled=[], toolbar='above', xaxis=None)
overlay_opts = dict(width=900, height=200, show_title=False)
vline_opts = dict(line_color='black')
opts = {'HeatMap': {'plot': hm_opts},
'Overlay': {'plot': overlay_opts},
'VLine': {'style': vline_opts}}
(heatmap + agg * marker).opts(opts).cols(1)
By the way, the only thing about any of this that's specific to Bokeh is being able to interact with elements of the plot; HoloViews can use Matplotlib instead of Bokeh to generate any of the plots if we don't need zoom, hover, etc.:
%%output backend='matplotlib'
measles_by_state * hv.VLine(1963) * hv.Text(1963, 1000, " Vaccine introduced", halign='left')
As you can see, there are lots of options for getting quick plots to explore your data in a browser, and if you choose HoloView+Bokeh plots, you can have full interactivity with very little code to explore even quite complex datasets.
Interactive statistical plots ¶
For high-dimensional datasets with additional data variables, we can compose all the above faceting methods as needed.
For instance, let's look at the Iris dataset:
from holoviews.operation import gridmatrix
from bokeh.sampledata.iris import flowers as iris
iris.tail()
We can look at all these relationships at once, interactively:
%%opts Bivariate [bandwidth=0.5] (cmap=Cycle(values=['Blues', 'Reds', 'Oranges']))
%%opts Points [tools=['box_select','lasso_select']] (size=2 alpha=0.7) NdOverlay [batched=False]
iris_ds = hv.Dataset(iris).groupby('species').overlay()
density_grid = gridmatrix(iris_ds, diagonal_type=hv.Distribution, chart_type=hv.Bivariate)
point_grid = gridmatrix(iris_ds, chart_type=hv.Points)
density_grid * point_grid
Dealing with large data and geo data ¶
PyViz is a modular suite of tools, and when you need capabilities not handled by Bokeh and HoloViews as above, you can bring those in:
- GeoViews : Visualizable geographic HoloViews objects
- Datashader : Rasterizing huge HoloViews objects to images quickly
- Param : Declaring user-relevant parameters, making it simple to work with widgets inside and outside of a notebook context
- Colorcet : perceptually uniform colormaps for big data
Let's look at a large(ish) dataset of 10 million taxi trips on a map.
import holoviews as hv, geoviews as gv, dask.dataframe as dd, cartopy.crs as crs
from colorcet import fire
from holoviews.operation.datashader import datashade
df = dd.read_parquet('../data/nyc_taxi_wide.parq').persist()
options = dict(width=700, height=600, xaxis=None, yaxis=None, bgcolor='black')
points = hv.Points(df, ['pickup_x', 'pickup_y'])
taxi_trips = datashade(points, x_sampling=0.5, y_sampling=0.5, cmap=fire).opts(plot=options)
url = 'https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.jpg'
tiles = gv.WMTS(url, crs=crs.GOOGLE_MERCATOR)
tiles * taxi_trips
As you can see, you can specify geo plots easily with GeoViews, and if your HoloViews objects are too big to visualize in a browser directly, you can add
datashade()
to render them into images dynamically on zooming, etc.
You can also easily add widgets to control filtering, selection, and other options interactively, either here in the notebook or in a standalone server:
import param, parambokeh
from colorcet import cm_n
from holoviews.streams import RangeXY
url='https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.jpg'
tiles = gv.WMTS(url,crs=crs.GOOGLE_MERCATOR)
opts = dict(width=1000,height=600,xaxis=None,yaxis=None,bgcolor='black',show_grid=False)
class NYCTaxiExplorer(hv.streams.Stream):
alpha = param.Magnitude(default=0.75, doc="Alpha value for the map opacity")
colormap = param.ObjectSelector(default=cm_n["fire"], objects=cm_n.values())
location = param.ObjectSelector(default='dropoff', objects=['dropoff', 'pickup'])
def make_view(self, x_range, y_range, **kwargs):
map_tiles = tiles.options(alpha=self.alpha, **opts)
points = hv.Points(df, [self.location+'_x', self.location+'_y'])
taxi_trips = datashade(points, x_sampling=0.5, y_sampling=0.5, cmap=self.colormap,
dynamic=False, x_range=x_range, y_range=y_range, width=1000, height=600)
return map_tiles * taxi_trips
explorer = NYCTaxiExplorer(name="NYC Taxi Trips")
parambokeh.Widgets(explorer, callback=explorer.event)
hv.DynamicMap(explorer.make_view, streams=[explorer, RangeXY()])
As you can see, the PyViz tools let you integrate visualization into everything you do, using a small amount of code that reveals your data's properties and captures your understanding of it. The rest of these tutorials will break down each of the topics covered above, showing you step by step how to work with your own data using these tools.
Thanks to all the PyViz contributors, including James A. Bednar, Philipp Rudiger, Jean-Luc Stevens, Bryan Van de Ven, Mateusz Paprocki, Joseph Crail, Greg Brener, and Chris Ball.